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Abstract 

Marginalized models are in great demand by most researchers in the life sciences particularly 
in clinical trials, epidemiology, health-economics, surveys and many others since they allow gen¬ 
eralization of inference to the entire population under study. For count data, standard procedures 
such as the Poisson regression and negative binomial model provide population average inference 
for model parameters. However, occurrence of excess zero counts and lack of independence 
in empirical data have necessitated their extension to accommodate these phenomena. These 
extensions, though useful, complicates interpretations of effects. For example, the zero-inflated 
Poisson model accounts for the presence of excess zeros but the parameter estimates do not 
have a direct marginal inferential ability as its base model, the Poisson model. Marginalizations 
due to the presence of excess zeros are underdeveloped though demand for such is interestingly 
high. The aim of this paper is to develop a marginalized model for zero-inflated univariate count 
outcome in the presence of overdispersion. Emphasis is placed on methodological development, 
efficient estimation of model parameters, implementation and application to two empirical stud¬ 
ies. A simulation study is performed to assess the performance of the model. Results from the 
analysis of two case studies indicated that the refined procedure performs significantly better than 
models which do not simultaneously correct for overdispersion and presence of excess zero counts 
in terms of likelihood comparisons and AlC values. The simulation studies also supported these 
findings. In addition, the proposed technique yielded small biases and mean square errors for 
model parameters. To ensure that the proposed method enjoys widespread use, it is implemented 
using the SAS NLMIXED procedure with minimal coding efforts. 

Keywords: Marginal model; Maximum likelihood estimation; Negative binomial; Overdisper¬ 
sion; Poisson model; Zero-Inflation. 


1 Introduction 

Studies involving count data are widespread. They can be found in contemporary research areas 
such as in clinical trials, epidemiology studies, health-economics, surveys and other experiments in 
biopharmaceutical and bioinformatics. When the response of interest is of count type, the Poisson 
regression, which is placed within the generalized linear modeling (GLM) framework (Nelder and 
Wedderburn 1972, McCullagh and Nelder 1989, Agresti 2002), is routinely used to model the effect 
of covariates on the observed counts. Its application can be found in several research fields. 

The most efficient way to make reliable inferences from well designed and executed studies is to 
choice an appropriate statistical model which reflects not only the design of the study but also 
certain characteristics of the data. The Poisson regression, though popular, fails to address certain 
attributes of the data and key design features and has led to several extensions. In the presence 
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of many zero counts, especially in studies that involve 'rare' events, the Poisson regression is far 
from optimal. The zero-inflated Poisson (ZIP) model has been proposed to model count data with 
excessive zeros. Related to the presence of excess zeros is the phenomenon called overdispersion. 
The Poisson distribution, a member of the exponential family of distributions, is noted for having 
a strict mean-variance relationship which is often inadequate to capture the variability inherent in 
empirical data. In other to allow for inflation of the variance of the outcome, the negative binomial 
(NB) model has been developed and applied in many studies. Underdispersion is well possible but 
rarely encountered. For underdispersed data, the generalized Poisson, or perhaps the hurdle model 
is used. A broad overview of models and estimation methods for overdispersed data can be found 
in Hinde and Demtrio (1998ab). As can be expected, excess zeros and overdispersion do occur 
together in practice. The zero-inflated negative binomial (ZINB) model is routinely used to handle 
both simultaneously and has also been implemented in several studies (Sheu and Liang, 1987). 

Despite the useful extensions made to improve the Poisson regression in the presence of many 
zeros and overdispersion, interpretation of model parameters are hampered. Precisely, the marginal 
interpretation of effects of explanatory variables on the response is lost. Instead, the parameters have 
a latent class interpretation. This is because the Zl models assume separate models for the process 
generating excess counts and the positive counts. The implication is that, different sets of parameters 
are associated with a subpopulation of at-risk or susceptible and a subpopulation of not-at-risk or 
non-susceptible groups and hence inference targeted at the entire population is difficult to obtain. 
An approach for obtaining marginal inference is therefore required for count data with excess zeros. 

Heagerty (1999) introduced a technique that does not alter the marginal interpretation of model 
parameters when normal random effect are employed to correct for lack of independence in longi¬ 
tudinal binary outcome. This marginalized multilevel model (MMM) defines separately a marginal 
mean model and a conditional mean model and the two models are held together by a so-called 
connector function. Iddi and Molenberghs (2012; 2013) extended this marginalized model to accom¬ 
modate for overdispersion (COMMM model) in the presence of subject-specific random effects. Lee 
et al (2011) also proposed an extension of the MMM to zero-inflated clustered count data using 
the hurdle model (Mullahy 1986). The form of marginalization considered by these authors is over 
so-called subject-specific random effects, used to handle association in longitudinal or clustered data. 
Long et al (2014) adapted these ideas and proposed a marginalized model that estimates overall 
exposure effects in the ZIP model for univariate count outcome. The marginalized zero-inflated 
model (MZIP) eliminates the latent class interpretation of regression coefficients in the traditional 
ZIP model and instead allows for exposure effect on the entire population under study. However, 
this model is not suited for univariate count data exhibiting overdispersion. Therefore, this paper 
aims to refine the MZIP model to handle marginalization in the presence of excess zeros and also 
encompass overdispersion, due to unobserved heterogeneity, that naturally occur with count data. 
The modeling framework envisaged takes into account these attributes of the data as well as permit 
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population average inference of model quantities. This guarantees efficient estimation of model pa¬ 
rameters and ensures proper statistical inferences are made leading to valid research conclusions for 
policy decisions and recommendations. Also, this will help solve interpretation and implementation 
challenges faced by many applied analyst. 

The rest of the paper is organized in the following order. Section is used to introduce two 
motivating datasets; these are analyzed in Section [5] A review of existing methodology is provided 
and the refined technique presented in Section The maximum likelihood estimation strategy used 
for fitting the models is discussed in Section is the topic for Section Simulation results are 
discussed in Section The paper conclude with final remarks in Section 

2 Case Studies 

The main purpose of this section is to present two case studies used to illustrate the proposed 
methodology and how it compares with existing methods. The data resulting from these studies 
exhibit both overdispersion and zero inflated counts which are attributes investigated by the proposed 
technique. These studies are described in turn. 

2.1 A Clinical Trial in Epileptic Patients 

The data are from a randomized, double-blind, parallel group, multi-center study for the comparison 
of placebo with a new anti-epileptic drug (AED), in combination with one or two other AED’s 
(Faught et al 1996). Patients were randomized after a 12-week stabilization period for the use of 
AED's, and during which the number of seizures were counted. After that run-in period, 45 patients 
were assigned to the placebo group, 44 to the new treatment. Patients were measured weekly and 
followed (double-blind) during 16 weeks; thereafter they entered a long-term open-extension study. 
Some patients were followed for up to 27 weeks. The outcome of interest is the number of epileptic 
seizures experienced during the latest week, i.e., since the last time the outcome was measured. The 
research question is whether or not the additional new treatment reduces the number of epileptic 
seizures. 

In Figure B a histogram of the number of epileptic seizures shows a higher proportion of zero counts 
accounting for about 33% of the data. Also, a simple descriptive statistics shows a very high variance 
of 37.70, as compared to the empirical mean of 3.18, an indication of overdispersion. This data is 
therefore suited for illustrating the proposed model. 

2.2 The Whitefly Study 

This data resulted from a horticultural experiment design to examine the effect of six methods of 
applying insecticide imidacloprid to poinsettia plants. The data has previously been reported by van 
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Figure shows that, at every level of treatment and block, the variance is always higher than 
the mean, reflecting overdispersion. In Figure the histogram reveals higher occurrences of zero 
immature whiteflies, which cannot be accounted for by the variance function of a Poisson or negative 
binomial distribution. It therefore seems sensible to apply the proposed technique. 


3 Methodology 

3.1 Notation 

Let Yi be the number of counts for an independent subject z (f = 1, 2,..., n). Assume that together 
with the response, a set of regressors are recorded for each subject denoted Xij for j = 1,2,... ,p, 
where p is the number of explanatory variables. Another set of covariates is represented as Zik which 
is a subset of Xij and k < j. In vector notation, the covariates are written as Xi and Zi for 
the fth individual. Also, let the marginal and conditional expected count be denoted by pi and A* 
respectively. 
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3.2 Background Methodology 


For independent count outcomes, the commonly used technique to evaluate the effect of explanatory 
variables on the response is the traditional Poisson regression. The response variable Yi is assumed 
to follow the Poisson distribution with mean m. The marginal mean is regressed on a set of covariate 
Xi using a log-link. Thus, 

Yi ~ Poisson(/.tj) 


and 


log(^,) = X'if3 


where /3 is a vector of parameters associated with the vector of covariates, X^. The relationship 
between the response and the set of predictors is thus captured by /3. 


The Poisson regression model assumes, in its simplest form, that the marginal mean and variance 
of the response are equal. This strong assumption, often not tenable for empirical data due to 
heterogeneity introduced in the data when important covariates are omitted from the study, is 
relaxed by applying an overdispersed model. A commonly used overdispersed model is the negative 
binomial regression model. It assumes that the counts follows a Poisson distribution with conditional 
mean A*. This mean is also allowed to follow the gamma distribution with shape and scale parameter 
a and b respectively. The resulting marginal distribution is the negative binomial distribution with 
density represented by 


fiVi) 


r(a + Vi) f b / 1 \ 

r(a)|/i! u + ^/ Vl + ^y 


The first two marginal moments, using iterated expectations, are given respectively by 


E{Y) = E{E{Y\Xi)} = ab = fii 

VariYi) = Var{E(yi|A,)} + S{Var(yi|Ai)} = ^*(1 + kfn). 

The parameter /c = ^ is called the overdispersed parameter. When A: = 0, the model reduces to the 
Poisson model. Since k > 0, the model only models overdispersion and hence this model cannot be 
used to model underdispersion. The NB regression models relates observed predictors to observed 
counts by taking log(/ij) = X^/3. 

Next, the Poisson and NB models assume that zero and non-zero counts are generated from the 
same mechanism. However, in the presence of excessive amount of zero counts, which occur mostly 
for rare events, these models are not optimal. This is because, they are unable to accommodate for 
the extra dispersion due to the presence of zeros. The zero-inflated Poisson (ZIP) has been proposed 
to address this issue. The model assumes that counts are rather generated by two processes. The 
first process generates zero counts with probability vTj while the non-zero counts follows the Poisson 
distribution with parameter A* and are realized with probability (1 — tt). In addition, the model 
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assumes that zero counts are generated from two sources based on the probabilities of the two 
processes. Thus, 


1 "* 




0 with probability tt + (1 — 7r)e 

A*'* 

Hi with probability (1 — 7ri)e 


The first two moments for ZIP model are given by 


Vi e 


E{Yi) = (1 - 7ri)Aj = lii 

Var(yi) = Ai(l - 7ri)(l + TTjAj). 

This reduces to the Poisson model when tt^ = 0. Note that the variance depends on the probability 
of zeros, tt* and as TTj approaches 1, the variance increases and thus accommodates greater dispersion 
in the data. To fit this model, the logistic regression model is used to model the probability VTi of zero 
counts and the log-linear Poisson(Ai) model for the positive realizations. For vector of covariates Zi 
and Xi with their associated vector of parameters a and /3 respectively, the model specifications 
are as follows: 

logit(7ri) = X'a 

and 

log(A,) = X'/3. 


3.3 Marginal Effects and Incidence Density Ratio 


Marginal effect allows us to generalize the effects of predictors on the response variable to the 
entire population under consideration. Such inferences are based on the parameters associated with 
the predictors. In the traditional Poisson or negative binomial models, the regression coefficients 
are interpreted in terms of the differences in the logs of the expected counts for a unit change in 
the predictor variables or as the log of the ratio of expected counts. Equivalently, the models are 
interpreted in terms of incident density (rate) ratio (IDR), obtained by exponentiating the regression 
estimates. Let //jj and Xij be the mean and jth predictor variable evaluated at j respectively. Also, 
let be a vector of predictors where the jth variable has been removed from Xi with associated 

vector of regression coefficients /3(j). Then the IDR, the ratio of the marginal expected mean for a 
unit increase in the predictor variable Xij, is given by 


E{Yi\xij+i = j + 1) 

E{Yi\xij = j) 




exp(/?y) 


where /3j is the parameter associated with the jth predictor. This ratio is thus constant over the 
various levels of all other predictors in the regression model. 
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For the zero-inflated models, the marginal mean m is of the form: 


^^i = E{Y) 


exp{X[l3) 

1 -F exp(Z'a)' 


This depends on parameters associated with predictors in both components of the zero-inflated 
model. Assume that the same predictors Xi are used in both the logistic and log-linear part of the 
models, then the IDR is expressed as 

'till = 

1 -F exp {{j + l)aj + j 


Unlike the Poisson and NB models, the IDR varies across the various levels of the predictors in 
the logistic part of the zero-inflated model. Only when aj = 0 is the IDR constant across the 
levels of the extraneous predictors. Thus one has to employ a summary measure to obtain a single 
measure for IDR for a given predictor in the presence of the other predictors. Next, estimates of 
the variability of the IDR are obtained using the delta method or bootstrap resampling techniques. 
However, implementation of these techniques are cumbersome and require additional computational 
efforts since they are not readily available in standard software packages. 


Recent development by Long et al (2014) allows analyst to fit a zero-inflated model with marginal 
effect of explanatory variables on the expected counts. The model also admits constant IDR for a 
given covariate in the presence of other predictors in both the logistic and the other component of 
the model. This model is reviewed in the next section and an extension to this procedure is proposed 
to accommodate for overdispersion. 


3.4 Proposed Methodology 

Long et al (2014) proposed an easy alternative to estimate overall exposure effects in a zero-inflated 
Poisson model. Instead of relating the mean of the process generating the positive counts, or the 
Poisson mean, A* to predictors using the log-link, they expressed the marginal mean, m in terms 
of predictors. The detailed model specifications are as follows: logit(7rj) = Z[cx, log(Ai) = 6i and 
log(/ii) = X[f3 where 6i is unknown function to be determined from 

/ij = (1 - 7ri)Ai. 

After substituting the various expressions and solving for 6i, we obtain 

6i = X'if3 + log(l -F exp(Z'Q;)). 

The likelihood function is then modified based on these new expressions and is presented in Section]^ 

A marginal zero-inflated negative binomial model (MZINB), an extension of the MZIP model, is 
carried out to estimating marginal effect predictors on the marginal response. An added advantage 
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to this useful extension is that, the model is able to correct for overdispersion due to the presence of 
both inflated zero counts and heterogeneity due to the absence of important predictors in the model. 
The latter is not addressed by the MZIP model. 

The MZINB is also based on the zero-inflated negative binomial (ZINB) model. Let 


Yi 


0 with probability vr-|-(1 — 7r)pfc , 

yi with probability (1 -7ri)y*|^(l -p)yip^, yi e Z+. 


where p = The marginal mean pi is similar to the mean from the ZIP model. However, the 

variance, which depends on the overdispersed parameter k and vTi, takes the form 


Var(yi) = Ai(l - 7ri)(l -L \i{k + TTi)). 


Thus, the model flexibly accounts for overdispersion due to the presence of excess zeros and hetero¬ 
geneity due to the absence of omitted important explanatory variables. 

To fit the MZINB model, we take pi = exp(X'/3) as opposed to A* = exp(X'/3) in the traditional 
ZINB. Also, the mean of the positive counts or the negative binomial mean takes the form A* = 
exp((5). The expression for Si is similarly to that of the MZIP model. The difference between the 
two procedures is clearer in their likelihood specification as discussed in Section]^ 


4 Estimation 


Several estimation routes, such as pseudo-likelihood (Aerts et al, 2002; Molenberghs and Vebeke, 
2005), generalized estimating equations (Zeger, Liang, and Albert, 1988), and Bayesian methodology, 
are possible to in order to estimate the parameters in the models. In this paper, parameters in the 
models are estimated following maximum likelihood estimation technique. This estimation procedure, 
like many others, obtains a set of parameters that maximizes the marginal likelihood of the data. 
The likelihood for the marginal zero-inflated Poisson (MZIP) model is written as: 


L{7ri,Xi\yi) = = 0)(1-TTi) 


2 = 1 


TTi 


1 - TTi 


+ e 


-A, 




> 0)(1-7ri)e (1) 


2 = 1 


Vi'- 


Substituting TTi = expit(Z'Q:) and Aj = exp (Si) = (1 -L exp(Z'Q:)) exp(JC'/3) into ([^, the likelihood 
in terms of the parameters a and /3 becomes 


Y(cx,/3lyi) = n(l + 


2=1 


Z'a 


e * 


-1 


n = 0) ( + e 


. 2=1 


n 

2=1 




(^i+exp(Z(a)J exp ' 


-L e i 


Z / _ \ y* p 

.a \ ^ 


m'- 


X 












For the extended version (MZINB) with overdispersed parameter k, the likelihood is given by 


L{TTi,\i\yi) = ]J/(7/i = 0)(1 -TTi) 


i=l 


VT,; 


1 — TTj 


+ 


1 


1 “F kXi 


Substituting expressions for vTj and A* into Q yields 


k\, 


(2) 


L{oL,P\yi) = + 


i=l 


Z',cx 


-1 


n^(y*=o) 


. 2=1 


ry! 1 


^ r(y. I') ] 

X > Q) r7l^ I C^-Pif'Pi 

i=i 


The maximum likelihood estimates a, (3 and k are obtained 


where Pi =- j - j =—v- 

i+fcn+exp(Z'Q:)jexp(X'/3) 

through numerical maximization. The asymptotic variance-covariance matrix can be derived from 

the likelihood expression. We define the Hessian matrix of the mixed partial second derivatives of 

the log-likelihood, I by 


H = 




-Kv) 


dVidVj 

where rj = {a,P,k). The Fisher's information matrix is given by 


/(r,) = -E{H{rj)). 


The estimates can be obtained rather easily using the SAS software package procedure NLMIXED. 
The procedure allows to specify user defined log-likelihood and it returns, in addition, standard errors 
of the parameters. The standard errors are produced by taking the square root of the inverse of the 
Fisher's information matrix. Since the procedure performs all the numerical details, the applied 
analyst can avoid deriving close form score equations and the Fisher's information matrix. 

The fit of the models are assessed using —2Log-likelihood and the Akaike Information Criterion (AlC; 
Akaike, 1974). The model with the minimum value for each of the criteria is often considered the 
referred or 'best' model. AlC is calculated using the formula AlC = —2Log-likelihood -F 2p where p 
is the number of parameters in the model. 


5 Analysis of Case Studies 

5.1 Analysis of the Epilepsy Data 

Six models are fitted to the data to their compare results. For each of the models, the dependent 
variable, Yi, is the number of epileptic seizures experienced by patient i which follows either a Poisson 
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or negative binomial distribution. Treatment and time were treated as independent variables in the 
count part of the models and only time in the logistic part of the zero-inflated models. The Poisson, 
NB, MZIP and MZINB model can be viewed as ’marginal’ models because they relates the marginal 
mean //j to the independent variables while in the ZIP and ZINB, the mean of the distribution of 
the positive counts, Aj are regressed on the predictors. Thus, in the Poisson, NB, MZIP and MZINB 
models, 

log(/^i) = /3o + ATreatmentj -P /32Timej. 

For the zero-inflated models, the logistic regression model is specified as: 

logit(7ri) = ao + aiTimej. 

The results of these models, parameter estimates and standard errors, are presented in Table[^ Gen¬ 
erally, all parameters in the logistic-part of the zero-inflated models and the overdispersed parameter 
of the negative binomial models were found to be significant. Except for the ZINB model. Time was 
found to be significant in the count-part of the models. Treatment was found not to be significant 
in the Poisson and NB but significant for the MZIP model. However, the improved MZINB which 
acknowledged overdispersion, resulted in a non-significant results as the Poisson and NB model. For 
the ZIP and ZINB which have latent class interpretations, both Treatment and Time were significant 
for the former and not significant for the later. Results of model selection criteria, log-likelihood and 
AlC, varies for the different models. For the marginal models, the proposed MZINB model yielded the 
highest likelihood and smallest AlC value. Therefore, the proposed model seems to perform better 
than the rest of the marginal model and hence is essential for making inference and predictions. 

5.2 Analysis of the Whitefly Data 

The outcome of this case study is the number of immature whiteflies, Yijk for fth treatment in the 
jth block measured at the fcth week and the independent variables are Treatment, Block and Week. 
The marginal mean model is given by: 

= /So + Blockj -F Treatment* -F /3Weekfc. 

For the ZIP and ZINB models, Xijk rather than ^ijk is related to predictors. The probability of zero 
counts is modeled by: 

logit(7rj) = ao + oiWeek^. 

Week is treated as continuous and the other terms represent factor effects. Parameter estimates and 
standard errors of the fitted models are presented in Table All treatment levels and the effect of 
week were found to be significant in all the models. In addition, all parameters in the zero-inflated 
parts were also significant. The effect of Blockl and Block2 were not significant in all the negative 
binomial models whereas only in Block2 do we find a significant effect in the other models. Here 
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again, in terms of inference, the proposed model results in slightly different parameter estimates and 
standard errors from the MZIP model although most of the parameters were significant, none of the 
block effect was significant in the broader model which properly accounted for overdispersion and 
excess zero counts. 

It is observed from the model selection criteria that, extending the MZIP model by allowing overdis¬ 
persion improved the model fit significantly (smallest AlC and highest likelihood). This again high¬ 
lights the importance of acknowledging overdispersion in the model and hence can lead to better 
inference about the effect of independent variables on the response. 

6 Simulation Study 

Simulations are carried out in this section to study some properties of the proposed methods and 
how it compares with the Poisson, negative binomial, and the marginal zero-inflated Poisson models. 
Large datasets were generated from the MZIP and MZINB models under difference settings. These 
are examined in turn. 

6.1 Data generated the from marginal zero-inflated model 

This part of the simulation study is aimed at investigating the performance of the models particularly 
MZINB when data are only overdispersed due to the presence of excess zero counts. The simulated 
model utilized the following models: 

logit(7rj) = ao + o-ixn + a2Xi2 

log(Ai) = /3o + filXil + l32Xi2 

where f = 1, ,n, xa ~ Bernoulli(0.5) and Xi 2 follows a standard lognormal distribution. Zero-inflated 
counts were generated with the set of true parameters 

(ao, ai, as, /3o, /3i, /^s) = (0.6, -2, 0.25, 0.25, 0.4, 0.25). 

For each sample size n = (100,500,1000), 2000 datasets were generated from the marginal ZIP 
model and analyzed using the four models. Summary quantities, mean, standard errors, simulation 
based standard errors, bias, relative biased and mean square errors (MSB), are reported in Table 
and Table|^ Generally, increase in sample size reduces the bias and MSB of the parameter estimates 
in all four models. The Poisson regression, followed by the NB models are the worst performers since 
they yielded large bias and MSB as depicted in Figurej^and Figure [^respectively. The MZINB model 
is slight biased compared to the MZIP model but this is compensated by the increase in precision 
resulting in smaller MSB compared to that of the MZIP model. This is not surprising given that the 
data were generated from the MZIP model. However, the broader MZINB model is able to precisely 
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estimate the parameters as it also addresses the inflation of zeros and thus produces smaller measure 
of the overall variability. 


6.2 Data generated the from marginal zero-inflated negative binomial model 


The predictors used in this set of simulation study are similar to those used in Section |6.1[ To 
generated data from the marginal ZINB model, an additional parameter k is required. The true 
parameter values are slightly modified, 


ao, ai,a2,/3o, /3i, /32 = (0.6, -2, 0.3, 0.25,0.5,0.2). 


The impact of different levels of overdispersion are assessed using different values of the overdispersed 
parameter k = 1.5, 2.5,4. For each value of k, 2000 simulated datasets were generated from the 
marginal ZINB model for different sample sizes, n = (100,200,500,1000) and each of the four 
models fitted. Simulation results are presented in Table [5] Table Table for the Poison and 
negative binomial models, in Table Table Table [T^ for the MZIP and MZINB models, for 
the different values of k respectively. Graph of bias and MSB against sample size are respectively 
depicted in Figure and Figure From these results, it is observed that bias and MSB generally 
decreases with increasing sample size. Notably, the MSB increases with increase in overdispersed 
parameter but this diminishes with increase in sample size for all the models. The overdispersed 
parameter is poorly estimated by the NB model but better estimated by the proposed MZINB model 
with further improvements as sample size increases. Obviously, this is due to the excess zeros ignored 
by the negative binomial model but accounted for in the MZINB. Both the Poisson model and the 
negative binomial models fits poorly which is evident in the wide discrepancy between standard errors 
of parameters and the Monte Carlo based standard errors, large bias and MSB. The marginal ZINB 
model performs better than the rest of the models in terms of yielding the smallest bias as well as 
MSB for model parameters. 


7 Concluding Remarks 

It is commonly known that the Poisson regression is overly restrictive because of its mean-variance 
relationship and the presence of extra dispersion due to excess zeros. The ZIP and ZINB models are 
useful extension but give different interpretations of model parameters than the base models, namely, 
they have latent class rather than marginal interpretation. This paper has proposed an extension to 
the marginal ZIP model to address overdispersion. It has been shown, through the analysis of two 
case studies, that the extended model help improves model fit significantly and can help in drawing 
valid inference. Through simulation studies, it has been shown that even when data are generated 
from the MZIP model, the MZINB model tends to yield small MSB and bias. The MZIP model does 
worse when data are highly overdispersed. 
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The proposed model, due to it generality, can also be used to test the adequacy of MZIP model, 
i.e. by comparing the MZIP to the MZINB, we can test whether or not it is sufficient to use the 
MZIP model. If the data do not exhibit overdispersion, i.e. /c = 0, then the variance of the MZINB 
model reduces to the MZIP model. The marginal zero-inflated models are not a replacement to the 
traditional ZIP and ZINB model as the choice between latent class and marginal models will depend 
on the research question. If inference is targeted at providing population average inference about the 
effect of a variable or treatment, then it is easier and safer to begin with the proposed technique. 

It is worth noting that, the proposed methodology is applicable to univariate data. Extension to 
correlated data such as longitudinal, repeated measures or clustered data may be required. One needs 
to be careful again with interpretation of model parameters when random effects are introduced in 
the proposed technique to accommodate association inherent in the data. Such an introduction 
will result in subject-specific interpretation instead of population average interpretation. Special 
techniques will be needed to obtain marginal inference in the presence of subject-specific random 
effects. 
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Figure 2: Whitefly Data. Means and standard deviations by time (panel 1) and block (panel 2). 
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Table 3: Results of the Poisson and Negative Binomial Model based on 2000 Simulations from the 
MZIP. 


True parameters 


Poisson 


Negative Binomial 


0.25 

0.4 

0.25 

0.25 

0.4 

0.25 

- 

Sample Size 

Measure 

/3o 

/3i 

/32 

/3o 

/3i 

/32 

k 

100 

Estimate 

0.3663 

0.5049 

0.1133 

0.3375 

0.4925 

0.1296 

2.5710 


Std. error 

0.1235 

0.1371 

0.0288 

0.3118 

0.3588 

0.1148 

0.5725 


SB std. err. 

0.4002 

0.4372 

0.1336 

0.3319 

0.3435 

0.1673 

0.6474 


Bias 

0.1163 

0.1049 

-0.1367 

0.0875 

0.0925 

-0.1204 

1.5710 


Rel. bias 

0.4651 

0.2622 

-0.5467 

0.3500 

0.2313 

-0.4816 

1.5710 


MSE 

0.1737 

0.2021 

0.0365 

0.1178 

0.1265 

0.0425 

2.8872 

500 

Estimate 

0.4496 

0.4744 

0.1082 

0.3338 

0.4316 

0.1765 

2.6976 


Std. error 

0.0501 

0.0583 

0.0086 

0.1376 

0.1601 

0.0473 

0.2608 


SB std. err. 

0.1904 

0.2066 

0.0602 

0.1471 

0.1457 

0.0736 

0.2869 


Bias 

0.1996 

0.0744 

-0.1418 

0.0838 

0.0316 

-0.0735 

1.6976 


Rel. bias 

0.7985 

0.1860 

-0.5673 

0.3354 

0.0790 

-0.2941 

1.6976 


MSE 

0.0761 

0.0482 

0.0237 

0.0287 

0.0222 

0.0108 

2.9642 

1000 

Estimate 

0.4727 

0.4718 

0.1012 

0.3300 

0.4238 

0.1839 

2.7214 


Std. error 

0.0347 

0.0409 

0.0053 

0.0972 

0.1132 

0.0329 

0.1852 


SB std. err. 

0.1417 

0.1515 

0.0427 

0.1031 

0.1070 

0.0488 

0.1981 


Bias 

0.2227 

0.0718 

-0.1488 

0.0800 

0.0238 

-0.0661 

1.7214 


Rel. bias 

0.8906 

0.1795 

-0.5952 

0.3200 

0.0594 

-0.2645 

1.7214 


MSE 

0.0697 

0.0281 

0.0240 

0.0170 

0.0120 

0.0068 

3.0025 
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Table 4: Results of the MZIP and MZINB based on 2000 Simulations from the MZIP. 
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Table 5: Results of the Poisson and Negative Binomial Model based on 2000 Simulations from the 
MZINB with k = 1.5. 


True parameters 


Poisson 


Negative Binomial 


0.25 

0.5 

0.2 

0.25 

0.5 

0.2 

1.5 

Sample Size 

Measure 

/3o 

/3i 

02 

00 

/3i 

02 

k 

100 

Estimate 

0.3092 

0.6298 

0.0294 

0.2857 

0.7024 

0.0106 

5.4097 


Std. error 

0.1378 

0.1509 

0.0381 

0.4440 

0.5114 

0.1787 

1.2321 


SB std. err. 

0.5713 

0.6114 

0.1779 

0.5488 

0.5846 

0.2505 

1.1858 


Bias 

0.0592 

0.1298 

-0.1706 

0.0357 

0.2024 

-0.1894 

3.9097 


Rel. bias 

0.2369 

0.2597 

-0.8531 

0.1426 

0.4048 

-0.9471 

2.6064 


MSE 

0.3299 

0.3907 

0.0608 

0.3025 

0.3827 

0.0986 

16.6919 

200 

Estimate 

0.3613 

0.5897 

0.0486 

0.3283 

0.6104 

0.0565 

5.7296 


Std. error 

0.0905 

0.1011 

0.0221 

0.3123 

0.3636 

0.1183 

0.9090 


SB std. err. 

0.3743 

0.4243 

0.1114 

0.3609 

0.3878 

0.1741 

0.9412 


Bias 

0.1113 

0.0897 

-0.1514 

0.0783 

0.1104 

-0.1435 

4.2296 


Rel. bias 

0.4451 

0.1793 

-0.7569 

0.3131 

0.2208 

-0.7174 

2.8197 


MSE 

0.1525 

0.1881 

0.0353 

0.1364 

0.1626 

0.0509 

18.7754 

500 

Estimate 

0.3994 

0.5702 

0.0564 

0.3456 

0.5693 

0.0832 

5.8368 


Std. error 

0.0544 

0.0620 

0.0116 

0.1955 

0.2281 

0.0708 

0.5792 


SB std. err. 

0.2359 

0.2736 

0.0639 

0.2238 

0.2413 

0.1060 

0.6369 


Bias 

0.1494 

0.0702 

-0.1436 

0.0956 

0.0693 

-0.1168 

4.3368 


Rel. bias 

0.5977 

0.1405 

-0.7182 

0.3825 

0.1385 

-0.5838 

2.8912 


MSE 

0.0780 

0.0798 

0.0247 

0.0592 

0.0630 

0.0249 

19.2135 

1000 

Estimate 

0.4094 

0.5656 

0.0601 

0.3477 

0.5498 

0.0979 

5.8947 


Std. error 

0.0376 

0.0434 

0.0073 

0.1376 

0.1610 

0.0490 

0.4117 


SB std. err. 

0.1673 

0.1956 

0.0436 

0.1553 

0.1751 

0.0735 

0.4538 


Bias 

0.1594 

0.0656 

-0.1399 

0.0977 

0.0498 

-0.1021 

4.3947 


Rel. bias 

0.6377 

0.1313 

-0.6996 

0.3908 

0.0997 

-0.5106 

2.9298 


MSE 

0.0534 

0.0426 

0.0215 

0.0337 

0.0331 

0.0158 

19.5193 
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Table 6: Results of the Poisson and Negative Binomial Model based on 2000 Simulations from the 
MZINB with k = 2.5. 


True parameters 


Poisson 


Negative Binomial 


0.25 

0.5 

0.2 

0.25 

0.5 

0.2 

2.5 

Sample Size 

Measure 

/3o 

/3i 


/3o 

/3i 

^2 

k 

100 

Estimate 

0.2575 

0.7050 

0.0010 

0.2875 

0.7876 

-0.0375 

6.5425 


Std. error 

0.1463 

0.1587 

0.0421 

0.4939 

0.5651 

0.2042 

1.5199 


SB std. err. 

0.7074 

0.7483 

0.2157 

0.6946 

0.7131 

0.2896 

1.1279 


Bias 

0.0075 

0.2050 

-0.1990 

0.0375 

0.2876 

-0.2375 

4.0425 


Rel. bias 

0.0299 

0.4100 

-0.9952 

0.1501 

0.5752 

-1.1876 

1.6170 


MSE 

0.5005 

0.6020 

0.0861 

0.4839 

0.5912 

0.1403 

17.6140 

200 

Estimate 

0.3430 

0.6371 

0.0289 

0.3356 

0.6353 

0.0268 

7.7645 


Std. error 

0.0934 

0.1035 

0.0237 

0.3639 

0.4211 

0.1396 

1.2892 


SB std. err. 

0.4526 

0.5034 

0.1313 

0.4388 

0.4694 

0.2003 

1.2649 


Bias 

0.0930 

0.1371 

-0.1711 

0.0856 

0.1353 

-0.1732 

5.2645 


Rel. bias 

0.3719 

0.2741 

-0.8555 

0.3423 

0.2707 

-0.8660 

2.1058 


MSE 

0.2135 

0.2722 

0.0465 

0.1999 

0.2386 

0.0701 

29.3149 

500 

Estimate 

0.3855 

0.6003 

0.0454 

0.3461 

0.5912 

0.0616 

7.9611 


Std. error 

0.0556 

0.0628 

0.0124 

0.2269 

0.2646 

0.0835 

0.8285 


SB std. err. 

0.2683 

0.2998 

0.0718 

0.2534 

0.2833 

0.1194 

0.8805 


Bias 

0.1355 

0.1003 

-0.1546 

0.0961 

0.0912 

-0.1384 

5.4611 


Rel. bias 

0.5419 

0.2006 

-0.7729 

0.3845 

0.1825 

-0.6919 

2.1844 


MSE 

0.0903 

0.0999 

0.0291 

0.0734 

0.0886 

0.0334 

30.5989 

1000 

Estimate 

0.3916 

0.5978 

0.0526 

0.3491 

0.5738 

0.0765 

8.0249 


Std. error 

0.0384 

0.0439 

0.0079 

0.1597 

0.1865 

0.0579 

0.5880 


SB std. err. 

0.1971 

0.2193 

0.0470 

0.1812 

0.2011 

0.0826 

0.6328 


Bias 

0.1416 

0.0978 

-0.1474 

0.0991 

0.0738 

-0.1235 

5.5249 


Rel. bias 

0.5664 

0.1957 

-0.7369 

0.3963 

0.1476 

-0.6176 

2.2100 


MSE 

0.0589 

0.0577 

0.0239 

0.0427 

0.0459 

0.0221 

30.9250 
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Table 7: Results of the Poisson and Negative Binomial Model based on 2000 Simulations from the 
MZINB with k = 4.0. 


Poisson Negative Binomial 


True parameters 

0.25 

0.5 

0.2 

0.25 

0.5 

0.2 

4.0 

Sample Size 

Measure 

/3o 

/3i 

02 

/3o 

/3i 

02 

k 

100 

Estimate 

0.1684 

0.7823 

-0.0203 

0.1770 

1.0371 

-0.1067 

7.1537 


Std. error 

0.1577 

0.1700 

0.0456 

0.5262 

0.5997 

0.2233 

1.7016 


SB std. err. 

0.9353 

0.9822 

0.2331 

0.9783 

0.9851 

0.3289 

1.0319 


Bias 

-0.0816 

0.2823 

-0.2203 

-0.0730 

0.5371 

-0.3067 

3.1537 


Rel. bias 

-0.3262 

0.5646 

-1.1015 

-0.2920 

1.0743 

-1.5337 

0.7884 


MSE 

0.8814 

1.0444 

0.1029 

0.9624 

1.2589 

0.2022 

11.0106 

200 

Estimate 

0.3179 

0.6659 

0.0082 

0.3142 

0.7655 

-0.0251 

9.5471 


Std. error 

0.0968 

0.1071 

0.0255 

0.4059 

0.4685 

0.1622 

1.6513 


SB std. err. 

0.5230 

0.5784 

0.1370 

0.5683 

0.5915 

0.2319 

1.2085 


Bias 

0.0679 

0.1659 

-0.1918 

0.0642 

0.2655 

-0.2251 

5.5471 


Rel. bias 

0.2715 

0.3319 

-0.9589 

0.2569 

0.5310 

-1.1257 

1.3868 


MSE 

0.2781 

0.3621 

0.0556 

0.3271 

0.4204 

0.1044 

32.2308 

500 

Estimate 

0.3800 

0.6090 

0.0266 

0.3693 

0.5982 

0.0352 

11.0496 


Std. error 

0.0568 

0.0641 

0.0135 

0.2663 

0.3101 

0.1012 

1.2283 


SB std. err. 

0.3088 

0.3516 

0.0770 

0.3066 

0.3316 

0.1382 

1.3245 


Bias 

0.1300 

0.1090 

-0.1734 

0.1193 

0.0982 

-0.1648 

7.0496 


Rel. bias 

0.5202 

0.2179 

-0.8672 

0.4770 

0.1964 

-0.8241 

1.7624 


MSE 

0.1123 

0.1355 

0.0360 

0.1082 

0.1196 

0.0463 

51.4512 

1000 

Estimate 

0.4016 

0.5994 

0.0339 

0.3802 

0.5726 

0.0530 

11.1191 


Std. error 

0.0390 

0.0444 

0.0086 

0.1867 

0.2181 

0.0692 

0.8688 


SB std. err. 

0.2124 

0.2429 

0.0501 

0.2219 

0.2387 

0.0926 

0.9441 


Bias 

0.1516 

0.0994 

-0.1661 

0.1302 

0.0726 

-0.1470 

7.1191 


Rel. bias 

0.6062 

0.1988 

-0.8305 

0.5209 

0.1452 

-0.7350 

1.7798 


MSE 

0.0681 

0.0689 

0.0301 

0.0662 

0.0622 

0.0302 

51.5729 
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Table 8: Results of the MZIP and MZINB based on 2000 Simulations from the MZINB with k 
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Table 9: Results of the MZIP and MZINB based on 2000 Simulations from the MZINB with k = 2.5. 
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Table 10: Results of the MZIP and MZINB based on 2000 Simulations from the MZINB with k = 4 . 0 . 
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Figure 4: Plot of MSB against sample size for all models (Data generated from MZIP) 



Figure 5: Plot of bias against sample size for all models (Data generated from MZIP) 
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Figure 6: Plot of MSB against sample size for all models (Data generated from MZINB) 
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Figure 7: Plot of bias against sample size for all models (Data generated from MZINB) 
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